Online target-speech extraction method based on auxiliary function for robust automatic speech recognition

ABSTRACT

A target speech signal extraction method for robust speech recognition includes: initializing a steering vector for a target speech source and an adaptive vector, setting a real output channel of the target speech source as an output by the adaptive vector, initializing adaptive vectors for a noise and setting a dummy channel as an output by the adaptive vectors for the noise; setting a cost function for minimizing dependency between a real output for the target speech source and a dummy output for the noise; setting an auxiliary function to the cost function, and updating the adaptive vector for the target speech source and the adaptive vectors for the noise by using the auxiliary function and the steering vector; estimating the target speech signal by using the adaptive vector thereby extracting the target speech signal from the input signals; and updating the steering vector for the target speech source.

BACKGROUND OF THE INVENTION 1. Field of the Invention

The present invention relates to a pre-processing method for targetspeech extraction in a speech recognition system or a speech recognitionapparatus, and more particularly, a target speech extraction methodcapable of reducing a calculation amount and improving performance ofspeech recognition by performing independent component analysis by usinginformation on a direction of arrival of a target speech source or asteering vector for the target speech source.

2. Description of the Prior Art

With respect to an automatic speech recognition (ASR) system, since muchnoise exists in real environments, noise robustness is very important tomaintain. In many cases, degradation in performance of recognition ofthe speech recognition system are mainly caused from a differencebetween a learning environment and the real environment.

In general, in the speech recognition system, in a pre-processing step,a clear target speech signal which is a speech signal of a targetspeaker is extracted from input signals supplied through input meanssuch as a plurality of microphones, and the speech recognition isperformed by using the extracted target speech signal. In speechrecognition systems, various types of pre-processing methods ofextracting the target speech signal from the input signals are proposed.

In a speech recognition system using independent component analysis(ICA) of the related art, outputs signals as many as the input signalsof which the number corresponds to the number of microphones areextracted, and one target speech signal is selected from the outputsignals In this case, in order to select the one target speech signalfrom the output signals of which the number corresponds to the number ofinput signals, a process of identifying which direction each of theoutput signals are input from is required, and thus, there are problemsin that a calculation amount is overloaded and the entire performance isdegraded due to error in estimation of the input direction.

In a blind spatial subtraction array (BSSA) method of the related art,after a target speech signal output is removed, a noise power spectrumestimated by ICA using a projection-back method is subtracted. In thisBSSA method, since the target speech signal output of the ICA stillincludes noise and the estimation of the noise power spectrum cannot beperfect, there is a problem in that the performance of the speechrecognition is degraded.

On the other hand, in a semi-blind source estimation (SBSE) method ofthe related art, some preliminary information such as directioninformation is used for a source signal or a mixing environment. In thismethod, known information is applied to generation of a separatingmatrix for estimation of the target signal, so that it is possible tomore accurately separate the target speech signal. However, since thisSBSE method requires additional transformation of input mixing vectors,there are problems in that the calculation amount is increased incomparison with other methods of the related art and the output cannotbe correctly extracted in the case where preliminary informationincludes errors. On the other hand, in a real-time independent vectoranalysis (IVA) method of the related art, permutation problem acrossfrequency bins in the ICA is overcome by using a statistic modelconsidering correlation between frequencies. However, since one targetspeech signal needs to be selected from the output signals, problemsexist in the ICA or the like.

SUMMARY OF THE INVENTION

The present invention is to provide a method of accurately extracting atarget speech signal with a reduced calculation amount by usinginformation on a direction of arrival of a target speech source or asteering vector for the target speech source.

According to an aspect of the present invention, there is provided atarget speech signal extraction method of extracting the target speechsignal from the input signals input to at least two or more microphones,the target speech signal extraction method including: (a) initializing asteering vector fora target speech source and an adaptive vector w₁(k)for the target speech source, setting a real output channel of thetarget speech source as an output by the adaptive vector for the targetspeech source, initializing adaptive vectors w₂(k), . . . w_(M)(k) foranoise and setting a dummy channel as an output by the adaptive vectorsfor the noise; (b) setting a cost function for minimizing dependencybetween a real output for the target speech source and a dummy outputfor the noise using independent component analysis (ICA) or independentvector analysis (IVA); (c) setting an auxiliary function to the costfunction, and updating the adaptive vector w₁(k) for the target speechsource and the adaptive vectors w₂(k), . . . , w_(M)(k) for the noise byusing the auxiliary function and the steering vector for the targetspeech source; (d) estimating the target speech signal by using theadaptive vector for the target speech source thereby extracting thetarget speech signal from the input signals; and (e) updating thesteering vector for the target speech source, and

wherein the (b)˜(e) are executed repeatedly; and wherein the auxiliaryfunction is set an inequality relation so that the auxiliary functionhas always values greater than or same as that of the cost function.

In the target speech signal extraction method according to the aboveaspect, preferably, the (e) includes of (e1) estimating a target maskwhich is defined a value representing a ratio of a target speech signalpower to a sum of the target speech signal power and a noise signalpower; (e2) estimating covariance matrix for the target speech sourceusing the estimated target mask; (e3) obtaining a principal eigen vectorby eigen-vector decomposing the estimated covariance matrix; and (e4)estimating the steering vector from the principal eigen vector to updatethe steering vector.

In the target speech extraction method according to the presentinvention, in a speech recognition system, a target speech signal can beallowed to be extracted from input signals by using information of atarget speech direction of arrival which can be supplied as preliminaryinformation, and thus, the total calculation amount can be reduced incomparison with the extraction methods of the related art, so that aprocess time can be reduced.

The present invention is to provide a non-transitory computer readablestorage media having program instructions that, when executed by aprocessor of a speech recognition apparatus or a speech recognitionsystem, cause the processor to perform the target speech signalextraction method according to the above-mentioned aspect of the presentinvention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a configurational diagram illustrating a plurality ofmicrophones and a target source in order to explain a target speechextraction method for robust speech recognition according to the presentinvention.

FIG. 2 is a flowchart sequentially illustrating an algorithm accordingto a target speech extraction method for robust speech recognitionaccording to the present invention.

FIG. 3 is a table illustrating comparison of calculation amountsrequired for processing one data frame between a method according to thepresent invention and a real-time FD ICA method of the related art.

FIG. 4 is a configurational diagram illustrating a simulationenvironment configured in order to compare performance between themethod according to the present invention and methods of the relatedart.

FIGS. 5A to 5I are graphs illustrating results of simulation of themethod according to the present invention (referred to as ‘DC ICA’), afirst method of the related art (referred to as ‘SBSE’), a second methodof the related art (referred to as ‘ BSSA’, and a third method of therelated art (referred to as ‘RT IVA’) while adjusting the number ofinterference speech sources under the simulation environment of FIG. 4 .

FIGS. 6A to 6I are graphs of results of simulation the method accordingto the present invention (referred to as ‘DC ICA’), the first method ofthe related art (referred to as ‘ SBSE’), a second method of the relatedart (referred to as ‘BSSA’), and a third method of the related art(referred to as ‘RT IVA’) by using various types of noise samples underthe simulation environment of FIG. 4 .

FIGS. 7A and 7B illustrate a subband clique and a harmonic clique as twotypical clique cases.

DETAILED DESCRIPTION OF THE INVENTION

The present invention relates to a target speech signal extractionmethod for robust speech recognition and a speech recognitionpre-processing system employing the aforementioned target speech signalextraction method, and independent component analysis is performed inthe assumption that a target speaker direction is known or by using anestimated steering vector for a target speech source, so that a totalcalculation amount of speech recognition can be reduced and fastconvergence can be performed.

Hereinafter, a pre-processing method for robust speech recognitionaccording to an exemplary embodiment of the present invention will bedescribed in detail with reference to the attached drawings.

The present invention relates to a pre-processing method of a speechrecognition system for extracting a target speech signal of a targetspeech source that is a target speaker from input signals input to atleast two or more microphones. The method includes receiving informationon a direction of arrival of the target speech source with respect tothe microphones; generating a nullformer by using the information on thedirection of arrival of the target speech source to remove the targetspeech signal from the input signals and to estimate noise; setting areal output of the target speech source using an adaptive vector w(k) asa first channel and setting a dummy output by the nullformer as aremaining channel; setting a cost function for minimizing dependencybetween the real output of the target speech source and the dummy outputusing the nullformer by performing independent component analysis (ICA);and estimating the target speech signal by using the cost function,thereby extracting the target speech signal from the input signals.

In a target speech signal extraction method according to the exemplaryembodiment of the present invention, a target speaker direction isreceived as preliminary information, and a target speech signal that isa speech signal of a target speaker is extracted from signals input to aplurality of (M) microphones by using the preliminary information.

FIG. 1 is a configurational diagram illustrating a plurality ofmicrophones and a target source in order to explain a target speechextraction method for robust speech recognition according to the presentinvention. Referring to FIG. 1 , set are a plurality of the microphonesMic.1, Mic.2, . . . , Mic.m, and Mic.M and a target speech source thatis a target speaker. A target speaker direction that is a direction ofarrival of the target speech source is set as a separation angleθ_(target) between a vertical line in the front direction of amicrophone array and the target speech source.

In FIG. 1 , an input signal of an m-th microphone can be expressed byMathematical Formula 1.

$\begin{matrix} & {\left\lbrack {{Mathematical}{Formula}1} \right\rbrack}\end{matrix}$${x_{m}\left( {k,\tau} \right)} = {{\left\lbrack {A(k)} \right\rbrack_{m1}{S_{1}\left( {k,\tau} \right)}} + {\sum\limits_{n = 2}^{N}{\left\lbrack {A(k)} \right\rbrack_{mn}{S_{n}\left( {k,\tau} \right)}}}}$

Herein, k denotes a frequency bin number and τ denotes a frame number.S₁(k,τ) denotes a time-frequency segment of a target speech signalconstituting the first channel, and S_(n)(k,τ) denotes a time-frequencysegment of remaining signals excluding the target speech signal, thatis, noise estimation signals. A(k) denotes a mixing matrix in a k-thfrequency bin.

In a speech recognition system, the target speech source is usuallylocated near the microphones, and acoustic paths between the speaker andthe microphones have moderate reverberation components, which means thatdirect-path components are dominant. If the acoustic paths areapproximated by the direct paths and relative signal attenuation amongthe microphones is negligible assuming proximity of the microphoneswithout any obstacle, a ratio of target speech source components in apair of microphone signals can be obtained by using Mathematical Formula2.

$\begin{matrix} & \left\lbrack {{Mathematical}{Formula}2} \right\rbrack\end{matrix}$$\frac{\left\lbrack {A(k)} \right\rbrack_{m1}{S_{1}\left( {k,\tau} \right)}}{\left\lbrack {A(k)} \right\rbrack_{m^{\prime}1}{S_{1}\left( {k,\tau} \right)}} \approx {\exp\left\{ {j\omega_{k}\frac{{d\left( {m - m^{\prime}} \right)}\sin\theta_{target}}{c}} \right\}}$

Herein, θ_(target) denotes the direction of arrival (DOA) of the targetspeech source. Therefore, a “delay-and-subtract nullformer” that is anullformer for canceling out the target speech signal from the first andm-th microphones can be expressed by Mathematical Formula 3.

$\begin{matrix} & {\left\lbrack {{Mathematical}{Formula}3} \right\rbrack}\end{matrix}$${{Y_{m}\left( {k,\tau} \right)} = {{{X_{m}\left( {k,\tau} \right)} - {\exp\left\{ {j\omega_{k}\frac{{d\left( {m - 1} \right)}\sin\theta_{target}}{c}} \right\}{X_{1}\left( {k,\tau} \right)}m}} = 2}},\ldots,M$

The nullformer obtains the relative ratio of target speech signals byusing the information on the direction of arrival of the target speechsource, multiplies the relative ratio and an input signal of amicrophone and subtracts the multiplied value from input signals of apair of microphones so that the nullformer is configured for cancelingout the target speech source component from the input signal of amicrophone.

In order to derive a learning rule, the nullformer outputs are regardedas dummy outputs, and the real target speech output is expressed byMathematical Formula 4.Y ₁(k,τ)=w ₁ ^(H)(k)x(k,τ)  [Mathematical Formula 4]

Herein, H denotes the Hermitian transpose and w₁(k) denotes the adaptivevector for generating the real output. Therefore, the real output andthe dummy output can be expressed in a matrix form by MathematicalFormula 5.

$\begin{matrix}{{y\left( {k,\tau} \right)} = {\left\lbrack \frac{w_{1}^{H}(k)}{{- y_{k}}{❘l}} \right\rbrack{x\left( {k,\tau} \right)}}} & {\left\lbrack {{Mathematical}{Formula}5} \right\rbrack}\end{matrix}$ Herein,${{y\left( {k,\tau} \right)}\  = {\left\lbrack \frac{w_{1}^{H}(k)}{{- \gamma_{k}}{❘l}} \right\rbrack{x\left( {k,\tau} \right)}}},$y_(k) = [Γ_(k)¹, …, Γ_(k)^(M − 1)]^(T), andΓ_(k) = exp {jω_(k)dsin θ_(target)/c}.

Nullformer parameters for generating the dummy output are fixed toprovide noise estimation. As a result, according to the presentinvention, permutation problem over the frequency bins can be solved.Unlike an TVA method, the estimation of w₁(k) at a frequency binindependent of other frequency bins can provide fast convergence, sothat it is possible to improve performance of target speech signalextraction as pre-processing for the speech recognition system.

Therefore, according to the present invention, by maximizingindependency between the real output and the dummy output at onefrequency bin, it is possible to obtain a desired target speech signalfrom the real output.

According to the present invention, a method of estimating a steeringvector h (k) for the target speech source can be proposed. Similar tothe Mathematical Formula 1, S₁(k,τ) and S_(m)(k,τ),m≥2 denote a targetspeech source and a noise, respectively. A target mask

(k,τ) is defined as a ratio of a power of the target speech source to apower of an input signal

(k,τ) in which the target speech source and the noise are mixed.

${\mathcal{M}\left( {k,\tau} \right)} = \frac{{❘{S_{1}\left( {k,\tau} \right)}❘}^{2}}{\sum\limits_{m = 1}^{M}{❘{S_{m}\left( {k,\tau} \right)}❘}^{2}}$

However, the target speech source and the noise can be estimated by theoutput signal which is calculated by the mathematical formula 5 usingthe output channel and the dummy channel.

${{{diag}\left( {W^{- 1}(k)} \right)}{y\left( {k,\tau} \right)}} = \begin{bmatrix}{{\hat{S}}_{1}\left( {k,\tau} \right)} \\ \vdots \\{{\hat{S}}_{M}\left( {k,\tau} \right)}\end{bmatrix}$

The diag(W⁻¹(k)) is a value of calibrating the scale of the outputsignal and is obtained by calculating the inverse matrix W⁻¹(k) ofdemixing matrix and using a diagonal matrix diag(⋅) with theoff-diagonal element of the inverse matrix being zero.

Therefore, the target mask can be estimated as follows.

${\hat{\mathcal{M}}\left( {k,\tau} \right)} = \frac{{❘{{\hat{S}}_{1}\left( {k,\tau} \right)}❘}^{2}}{\sum\limits_{m = 1}^{M}{❘{{\hat{S}}_{m}\left( {k,\tau} \right)}❘}^{2}}$

The covariance matrix for the target speech source can be estimated byusing the estimated target mask.

${{\hat{R}}_{s}(k)} = {\frac{1}{\sum\limits_{\tau}{\hat{\mathcal{M}}\left( {k,\tau} \right)}}{\sum\limits_{\tau}{{\hat{\mathcal{M}}\left( {k,\tau} \right)}{x\left( {k,\tau} \right)}{x^{H}\left( {k,\tau} \right)}}}}$

Therefore, the steering vector for the target speech source can beestimated by the principal eigenvector which is obtained by eigen-valuedecomposing the estimated target spatial covariance matrix.

The steering vector can be estimated directly and stably withoutestimating the direction for target speech source.

The estimation of a steering vector can be estimated in on-line byupdating h^(H)(f; τ) in real-time.

${\left. {{diag}\left( {A\left\langle {k,\tau} \right.} \right)} \right){y\left( {k,\tau} \right)}} = \begin{bmatrix}{{\hat{S}}_{1}\left( {k,\tau} \right)} \\ \vdots \\{{\hat{S}}_{M}\left( {k,\tau} \right)}\end{bmatrix}$

The diag(W⁻¹(k)) is a value calibrating a scale of the output signal andcan use the inverse matrix A(k) of the demixing matrix being updatedcontinuously. Therefore, the target mask can be estimated by using theequation.

${\mathcal{M}\left( {k,\tau} \right)} = \frac{{❘{S_{1}\left( {k,\tau} \right)}❘}^{2}}{\sum\limits_{m = 1}^{M}{❘{S_{m}\left( {k,\tau} \right)}❘}^{2}}$

The covariance matrix for target speech source can be updated inreal-time by using the estimated target mask, a smoothing factor and themoving average.{circumflex over (R)} _(s)(k;τ)=α{circumflex over (R)} _(s)(k;τ−1)+(1−α)

(k,τ)x(k,τ)x ^(H)(k,τ)

Therefore, the steering vector for the target speech source can beestimated in on-line and in real-time by the principal eigenvector whichis obtained by eigen-value decomposing the estimated target spatialcovariance matrix.

With respect to the cost function, by Kullback-Leibler (KL) divergencebetween probability density functions p(Y₁(k,τ), Y₂(k,τ), ∩ ∩ ∩ ,Y_(M)(k,τ)) and q(Y₁(k,τ))p(Y₂(k,τ), ∩ ∩ ∩ ,Y_(M)(k,τ)), the termsindependent of w₁(k) are removed, so that the cost function can beexpressed by Mathematical Formula 6 of which is the target speechextraction method based on the independent component analysis (DC ICA).

$\begin{matrix} & {\left\lbrack {{Mathematical}{Formula}6} \right\rbrack}\end{matrix}$${I_{ICA}^{\prime}(W)} = {- {\sum\limits_{k = 1}^{K}\left\{ {{\log{❘{\sum\limits_{m = 1}^{M}{\Gamma_{k}^{m - 1}\left\lbrack {w_{1}^{H}(k)} \right\rbrack}_{m}}❘}} + {E\left\lbrack {\log{q\left( {Y_{1}\left( {k,\tau} \right)} \right)}} \right\rbrack}} \right\}}}$

Herein, [-]_(m) denotes an m-th element of a vector.

In addition, it may be worth considering the method for target speechextraction based on Independent Vector Analysis (DC IVA) in which thecost function using ŷ_(m)(τ)=[Y_(m)(1,τ), Y_(m)(2,τ), ∩ ∩ ∩ ,Y_(m)(K,τ)]^(T) in consideration of the dependency between frequencycomponents sets as follows.

$\begin{matrix} & {\left\lbrack {{Mathematical}{Formula}7} \right\rbrack}\end{matrix}$${{{I_{IVA}^{\prime}(W)} = \left. {- \log} \middle| {\sum\limits_{m = 1}^{M}{\Gamma_{k}^{m - 1}\left\lbrack {w_{1}^{H}(k)} \right\rbrack}_{m}} \right.}❘} - {E\left\lbrack {\log{q\left( {y_{1}(\tau)} \right)}} \right\rbrack}$

Because the above-mentioned DC ICA and DC IVA methods converge by usingmaximum gradient algorithm, there is a trade-off between the convergencevelocity due to the step size of learning rate and the stability.

On the other hand, the target speech extraction method using auxiliaryfunction sets the auxiliary function Q and is able to minimize fastwithout need for setting suitably the step size of learning rate. Theauxiliary function Q may be set the inequality relation using aninequality

${G_{R}(r)} \leq {{\frac{G_{R}^{\prime}\left( r_{0} \right)}{2r_{o}}r^{2}} + {G_{R}\left( r_{0} \right)} - \frac{r_{0}{G_{R}^{\prime}\left( r_{0} \right)}}{2}}$so that the auxiliary function has always values greater than or same asthat of the objective function.

Accordingly, through the optimization of the auxiliary function, theobjective function which has always a value smaller than or same as thatof the auxiliary function may be optimized along with it.

In the DC ICA method, the relationship between the auxiliary functionand the objective function is as follows.

$\begin{matrix} & {\left\lbrack {{Mathematical}{Formula}8} \right\rbrack}\end{matrix}$${I_{ICA}^{\prime} \leq Q} = {{E\left\lbrack {\sum\limits_{k = 1}^{K}{\frac{G^{\prime}\left( {r_{1}\left( {k,\tau} \right)} \right)}{2{r_{1}\left( {k,\tau} \right)}}{❘{Y_{1}\left( {k,\tau} \right)}❘}^{2}}} \right\rbrack} - {\log{❘{\sum\limits_{m = 1}^{M}{{\Gamma_{k}^{m - 1}\left\lbrack {w_{1}^{H}(k)} \right\rbrack}_{m}{❘{{+ R} = {{\frac{1}{2}{w_{1}^{H}(k)}{V_{1}(k)}{w_{1}(k)}} - {\log{❘\left. {\sum\limits_{m = 1}^{M}{\Gamma_{k}^{m - 1}\left\lbrack {w_{1}^{H}(k)} \right\rbrack}_{m}} \middle| {+ R} \right.}}}}}}}}}}$

Here, his a constant irrelevant to the estimated vector w₁(k). In the DCICA method, V_(m)(k) is as follows.

$\begin{matrix} & {\left\lbrack {{Mathematical}{Formula}9} \right\rbrack}\end{matrix}$$V_{m}\left( {(k) = {E\left\lbrack {\frac{G^{\prime}\left( {r_{m}\left( {k,\tau} \right)} \right)}{r_{m}\left( {k,\tau} \right)}{x\left( {k,\tau} \right)}{x^{H}\left( {k,\tau} \right)}} \right\rbrack}} \right.$

Here, r_(m)(k,τ)=|Y_(m)(k,τ)|=|w_(m) ^(H)(k)x(k,τ)|, and G′(r_(m)(k,τ))denotes a differentiation of G(r_(m)(k,τ))=−log q(Y_(m)(k,τ)) byr_(m)(k,τ).

Meanwhile, it can be expressed a probability density function of variousspeech sources in the time-frequency domain by modeling G′(r_(m)(k,τ))as a probability density function following the Generalized Gaussiandistribution expressed by Mathematical Formula 10.

$\begin{matrix} & {\left\lbrack {{Mathematical}{Formula}10} \right\rbrack}\end{matrix}$${p\left( {Y_{m}\left( {k,\tau} \right)} \right)} \propto {\frac{1}{\left. {\lambda_{m}\left( {k,\tau} \right.} \right\}}\exp\left\{ {- \left( \frac{{❘{Y_{m}\left( {k,\tau} \right)}❘}^{2}}{\lambda_{m}\left( {k,\tau} \right)} \right)^{\beta}} \right\}}$

Herein, λ_(m)(k,τ) and β are variance and shape parameter, respectively,and the type of pdf can be determined according to the value of theseparameters. For example, if the pdf is a Laplace distribution with aunit variance,

$\beta = \frac{1}{2}$and λ_(m)(k,τ)=1, and if the pdf is a Gaussian distribution, then β=1.

In addition, in the DC IVA method, the extended probability densityfunction using clique may be used as follows.

$\begin{matrix} & {\left\lbrack {{Mathematical}{Formula}11} \right\rbrack}\end{matrix}$${p\left( {{\hat{y}}_{m}(\tau)} \right)} \propto {\frac{1}{\prod\limits_{c = 1}^{N_{c}}{\lambda_{m}\left( {c,\tau} \right)}}\exp\left\{ {- {\sum\limits_{c = 1}^{N_{c}}\left( \frac{\sum\limits_{k \in \Omega_{c}}{❘{Y_{m}\left( {k,\tau} \right)}❘}^{2}}{\lambda_{m}\left( {c,\tau} \right)} \right)^{\beta}}} \right\}}$

Herein, c, N_(c) and Ω_(c) are a clique index, a quantity of the cliqueand a set of frequency bin included in the corresponding clique,respectively. Also, λ_(m)(c, τ) and β are a variance and a shapeparameter of the c-th clique in the m-th output, respectively. Herein,it can be expressed the probability density function of various kindsaccording to a design of the clique structure and a setting of the shapeparameter and the variance.

Also, in the DC IVA method based on the clique, the relationship betweenthe objective function and the auxiliary function is as follows.

$\begin{matrix}{{I_{IVA}^{\prime} \leq Q} = {{{E\left\lbrack {\sum\limits_{c = 1}^{N_{c}}\left( \frac{\sum\limits_{k \in \Omega_{o}}{❘{Y_{1}\left( {k,\tau} \right)}❘}^{2}}{\lambda_{1}^{\beta}\left( {c,\tau} \right)} \right)^{\beta}} \right\rbrack} - {\log{❘{\sum\limits_{m = 1}^{M}{{\Gamma_{k}^{m - 1}\left\lbrack {w_{1}^{H}(k)} \right\rbrack}m}}❘}} + R} =}} & {\left\lbrack {{Mathematical}{Formula}12} \right\rbrack}\end{matrix}$${\frac{1}{2}{w_{1}^{H}(k)}{V_{1}(k)}{w_{1}(k)}} - {\log{❘{\sum\limits_{m = 1}^{M}{{\Gamma_{k}^{m - 1}\left\lbrack {w_{1}^{H}(k)} \right\rbrack}m}}❘}} + R$

Therefore, V_(m)(k) is given as follows.

V m ( k ) = E ⁢ [ ∑ c = 1 N c ( ∑ k ∈ Ω o ❘ "\[LeftBracketingBar]" Y m (k , τ ) ❘ "\[RightBracketingBar]" 2 ) β - 1 λ m β ( c , τ ) ⁢ x ⁡ ( k ,  τ ) X H ⁢ ( k , τ ) ) ] [ Mathematical ⁢ Formula ⁢ 13 ]

To optimize the auxiliary function Q to w₁(k), w₁(k) satisfying themathematical formula 14 is obtained.

$\begin{matrix}{{\frac{\partial}{\partial{w_{1}^{*}(k)}}Q} = {{{\frac{1}{2}{V_{1}(k)}{w_{1}(k)}} - {{\frac{\partial}{\partial{w_{1}^{*}(k)}}\log}{❘{\sum\limits_{m = 1}^{M}{{\Gamma_{k}^{m - 1}\left\lbrack {w_{1}^{H}(k)} \right\rbrack}m}}❘}}} = 0}} & \left\lbrack {{Mathematical}{Formula}14} \right\rbrack\end{matrix}$

Therefore, w₁(k) is satisfied the mathematical formula 15.W(k)V ₁(k)w ₁(k)=e ₁  [Mathematical Formula 15]

Herein, e_(m) is a vector of which the m-th component only is 1 and theother components are all 0. Therefore, w₁(k) satisfies the followingmathematical formula 16.w ₁(k)=(W(k)V ₁(k))⁻¹ e ₁  [Mathematical Formula 16]

w₁(k) may be normalized according to the mathematical formulas 17 or 18as follows.

$\begin{matrix}{{w_{1}(k)} = \frac{w_{1}(k)}{\sqrt{{w_{1}^{H}(k)}{V_{1}(k)}{w_{1}(k)}}}} & {\left\lbrack {{Mathematical}{Formula}17} \right\rbrack}\end{matrix}$ $\begin{matrix}{{w_{1}(k)} = \frac{w_{1}(k)}{\sqrt{{w_{1}^{H}(k)}{w_{1}(k)}}}} & \left\lbrack {{Mathematical}{Formula}18} \right\rbrack\end{matrix}$

In addition, suppose that λ_(m)(k,τ) is a variance of varying accordingto time or frequency, the method for estimating the λ_(m)(k,τ) can beproposed variously.

For example,

${\lambda_{m}\left( {k,\tau} \right)} = {\beta^{\frac{1}{\beta}}{❘{Y_{m}\left( {k,\tau} \right)}❘}^{2}}$can be estimated through the likelihood maximization algorithm, or

${\lambda_{m}\left( {k,\tau} \right)} = {{\beta\left( \frac{{❘{Y_{m}\left( {k,\tau} \right)}❘}^{2}}{\lambda_{m}\left( {k,\tau} \right)} \right)}^{\beta - 1}{❘{Y_{m}\left( {k,\tau} \right)}❘}^{2}}$can be estimated through the recursive estimation method, whereλ_(m)(k,τ) in the right side of the equation can be modified toλ_(m)(k,τ−1).

${\lambda_{m} \smallsetminus \left( {k,\tau} \right)} = {\frac{1}{4}{❘{Y_{m}\left( {k,\tau} \right)}❘}}$in the Laplace distribution with

${\beta = \frac{1}{2}},$and λ_(m)(k,τ)=|Y_(m)(k,τ)|² in the Gaussian distribution with β=1.

Also, λ_(m)(k,τ) can be estimated as the equation of

${\lambda_{m}\left( {k,\tau} \right)} = {\frac{\beta^{\frac{1}{\beta}}}{{2N_{a}} + 1}{\sum_{\tau^{\prime} = {\tau - N_{a}}}^{\tau + N_{a}}{❘{Y_{m}\left( {k,\tau^{\prime}} \right)}❘}^{2}}}$by considering the value of the adjacent time frame and including N_(a)frames existing on the front and the rear of the current time frame τ,respectively, or λ_(m)(k,τ) can be estimated as the equation of

${\lambda_{m}\left( {k,\tau} \right)} = {\frac{\beta}{{2N_{a}} + 1}{\underset{\tau^{\prime} = {\tau - N_{a}}}{\sum\limits^{\tau + N_{a}}}{\left\lbrack \frac{{❘{{w_{m}^{H}(k)}{x\left( {k,\tau^{\prime}} \right)}}❘}^{2}}{\lambda_{m}\left( {k,\tau^{\prime}} \right)} \right\rbrack^{\beta - 1}{❘{{w_{m}^{H}(k)}{x\left( {k,\tau^{\prime}} \right)}}❘}^{2}}}}$through the recursive estimation method, where λ_(m)(k,τ) in the rightside of the equation can be modified to λ_(m)(k,τ−1).

Similarly, λ_(m)(k,τ) can be estimated by the recursive equation ofλ_(m)(k,τ)=γ|Y_(m)(k,τ)|²+(1−γ)λ_(m)(k,τ−1), where γ denoted a smoothingparameter. Therefore, λ₁(k,τ) and w₁ (k) updates repeatedly in turn tobe converged.

In addition, due to Y₁(k,τ)=w₁ ^(H)(k)x(k,τ), λ₁(k,τ) can be initializedby various methods including above mentioned methods through the initialvalue of w₁ (k). If the initial value of w₁ (k) is set a unit vector ofwhich the m-th element is 1, the initial value of λ₁(k,τ) can bemeasured by a power of input signal of an individual microphone. Inanother method, λ₁(k,τ) can be measured approximately by an improvedsignal power through the beamformer using DOA.

Also, in a method based of the clique, suppose that λ_(m)(c,τ) is avariance varying according to time or clique, the method for estimatingλ_(m)(c,τ) can be proposed variously in the same way.

For example,

${\lambda_{m}\left( {c,\tau} \right)} = {\left( \frac{\beta}{N_{C}} \right)^{\frac{1}{\beta}}\left( {\sum_{k \in \Omega_{c}}{❘{Y_{m}\left( {k,\tau} \right)}❘}^{2}} \right)}$can be estimated through the maximum likelihood method, or

${\lambda_{m}\left( {c,\tau} \right)} = {\frac{1}{N_{C}}\left( \frac{\sum_{k \in \Omega_{c}}{❘{Y_{m}\left( {k,\tau} \right)}❘}^{2}}{\lambda_{m}\left( {c,\tau} \right)} \right)^{\beta - 1}{\sum_{k \in \Omega_{c}}{❘{Y_{m}\left( {k,\tau} \right)}❘}^{2}}}$can be estimated through the recursive estimation method, whereλ_(m)(c,τ) in the right side of the equation can be modified toλ_(m)(c,τ−1).

Also, in order to update the steering vector, it is needed that thenoise signal being calculated in the dummy channel be updatedaccurately. Therefore, the Mathematical Formula 5 is replaced asfollows:

${y\left( {k,\tau} \right)} = {{{W(k)}{x\left( {k,\tau} \right)}} = {\begin{bmatrix}{w_{1}^{H}(k)} \\{w_{2}^{H}(k)} \\ \vdots \\{w_{M}^{H}(k)}\end{bmatrix}{{x\left( {k,\tau} \right)}.}}}$

Therefore, the output of the dummy channel is expressed as the adaptivematrices (w_(z)(k) . . . w_(M)(k)) for the noise output being capable oflearning. The target mask for updating the steering vector can beestimated effectively by using the noise output in the dummy channel inaddition to the real output in the target channel. In this time, theauxiliary function Q of the Mathematical Formula 12 is changed, asfollows:

$Q = {{{\sum\limits_{m = 1}^{M}{E\left\lbrack {\sum\limits_{c = 1}^{N_{c}}\left( \frac{\sum\limits_{k \in \Omega_{c}}{❘{Y_{m}\left( {k,\tau} \right)}❘}^{2}}{\lambda_{m}^{\beta}\left( {c,\tau} \right)} \right)^{\beta}} \right\rbrack}} - {\log{❘{\det{W(k)}}❘}} + R} = {{\frac{1}{2}{\sum\limits_{m = 1}^{M}{{w_{m}^{H}(k)}{V_{m}(k)}{w_{m}(k)}}}} - {\log{❘{\det{W(k)}}❘}} + R}}$

In Addition, in the target speech extraction method (DC ICA and DC IVA)using auxiliary function, if the auxiliary function is set with thedistortionless constraint which the input and output signals coming fromthe direction of the target speech using prior information are identicalwith each other, the scaling indeterminacy of signal which is estimatedby a parameter updated in conventional method does not generatefundamentally.

Therefore, it can be obtained the signal having a distortion smallerthan the conventional method, without applying the minimum distortionprinciple MDP to resolve the scaling indeterminacy problem in theconventional method.

If the distortionless constraint w₁ ^(H)(k)h(k)=1, and nullformingconstraint w_(m) ^(H)(k)h(k)=0, (m≠1) are added to the auxiliaryfunction Q, the auxiliary function Q′ is given as follows.

$\begin{matrix}{Q^{\prime} = {{\sum\limits_{m = 1}^{M}\left\lbrack {{\frac{1}{2}{w_{m}^{H}(k)}{V_{m}(k)}{w_{m}(k)}} + {\alpha\left( {{{w_{m}^{H}(k)}{h(k)}} - \beta_{m}} \right)}} \right\rbrack} - {\log{❘{\det{W(k)}}❘}} + R}} & {\left\lbrack {{Mathematical}{Formula}19} \right\rbrack}\end{matrix}$

Here, h(k) is a steering vector [Γ_(k) ⁰, Γ_(k) ¹, . . . , Γ_(k)^(M-1)]^(T) to a target speech, and β_(m) is 1 in m=1, and 0 in m≠1. Tominimize the above extended auxiliary function Q′, the estimationequation of w₁ ^(H)(k) is given as follows.

$\begin{matrix}{{w_{1}(k)} = \frac{\left( {{W(k)}{V_{1}(k)}} \right)^{- 1}e_{1}}{{h^{H}(k)}\left( {{W(k)}{V_{1}(k)}} \right)^{- 1}e_{1}}} & {\left\lbrack {{Mathematical}{Formula}20} \right\rbrack}\end{matrix}$

Until now, the method which only w₁ ^(H)(k) is estimated repeatedly byusing the above equation and the remaining w_(m) ^(H)(k) is fixed tonullformer was proposed. However, for performance improvement, theremaining w_(m) ^(H)(k),m≠1 may be estimated repeatedly together byusing the mathematical formula 21.

$\begin{matrix}\begin{matrix}{{w_{m}(k)} = {\left( {{W(k)}{V_{m}(k)}} \right)^{- 1}e_{m}}} \\

\end{matrix} & \left\lbrack {{Mathematical}{Formula}21} \right\rbrack\end{matrix}$${{w_{m}(k)} = {\frac{w_{m}(k)}{\sqrt{{w_{m}^{H}(k)}{V_{m}(k)}{w_{m}(k)}}}{or}}},$${w_{m}(k)} = \frac{w_{m}(k)}{\sqrt{{w_{m}^{H}(k)}{w_{m}(k)}}}$

By applying the distortionless constraint and the nullformingconstraint, the following equation can be satisfied.W(k)h(k)=e ₁

When A⁻¹(f)=W(f), A(k)e₁=h(k) is given.

By using the equations, the mathematical formula 20 can be rearranged asfollows:

$\left. {w_{1}(k)}\leftarrow\frac{\left( {{W(k)}{V_{1}(k)}} \right)^{- 1}e_{1}}{{h^{H}(k)}\left( {{W(k)}{V_{1}(k)}} \right)^{- 1}e_{1}} \right. = {\frac{{V_{1}^{- 1}(k)}{h^{H}(k)}}{{h^{H}(k)}{V_{1}^{- 1}(k)}{h^{H}(k)}}.}$

The mathematical formula 21 which is expressed the conventional demixingmatrix for the dummy channel does not reflect the Lagrange multiplier.Therefore, by reflecting the Lagrange multiplier to apply thenull-forming constraint, the mathematical formula 21 can be rearrangedas follows:

$\left. {w_{m}(k)}\leftarrow{{\left( {{W(k)}{V_{m}(k)}} \right)^{- 1}e_{m}} - {\frac{{h^{H}(k)}\left( {{W(k)}{V_{m}(k)}} \right)^{- 1}e_{m}}{{h^{H}(k)}\left( {{W(f)}{V_{m}(k)}} \right)^{- 1}e_{1}}\left( {{W(k)}{V_{m}(k)}} \right)^{- 1}e_{1}}} \right.,$m ≤ 2 ≦ M.

Therefore, the equations are rearranged as follows:

u_(m)(k) = (W(k)V_(m)(k))⁻¹e_(m)${{\hat{u}}_{m}(k)} = \frac{{V_{m}^{- 1}(k)}{h(k)}}{{h^{H}(k)}{V_{m}^{- 1}(k)}{h(k)}}$w_(m)(k) ← u_(m)(k) − (h^(H)(k)u_(m)(k))û_(m)(k).

To induce the real time target speech extraction method based on theauxiliary function, if the inverse matrix of V_(m)(k) in time frame r isdenoted U_(m)(k;τ) and the inverse matrix of W(k) is denoted A(k;τ),U_(m)(k;τ) can be obtained recursively from U_(m)(k;τ−1), as follows.

$\begin{matrix} & \left\lbrack {{Mathematical}{Formula}22} \right\rbrack\end{matrix}$${U_{m}\left( {k;\tau} \right)} = {\frac{1}{\alpha}\left( {{U_{m}\left( {k;{\tau - 1}} \right)} - \frac{{p_{m}\left( {k,\tau} \right)}{U_{m}\left( {k;{\tau - 1}} \right)}{x\left( {k,\tau} \right)}{x^{H}\left( {k,\tau} \right)}{U_{m}^{H}\left( {k;{\tau - 1}} \right)}}{\alpha + {{p_{m}\left( {k,\tau} \right)}{x^{H}\left( {k,\tau} \right)}{U_{m}^{H}\left( {k;{\tau - 1}} \right)}{x\left( {k,\tau} \right)}}}} \right)}$${A\left( {k;\tau} \right)} = {{A\left( {k;\tau} \right)} - \frac{{A\left( {k;\tau} \right)}e_{m}\Delta{w_{m}^{H}\left( {k;\tau} \right)}{A\left( {k;\tau} \right)}}{1 + {\Delta{w_{m}^{H}\left( {k;\tau} \right)}{A\left( {k;\tau} \right)}e_{m}}}}$

Here, suppose that w_(m)(k) is t_(m)(k;τ) in the time frame τ. Inaddition,

${p_{m}\left( {k,\tau} \right)} = {\left( {1 - \alpha} \right)\frac{G^{\prime}\left( {r_{m}\left( {k,\tau} \right)} \right)}{r_{m}\left( {k,\tau} \right)}{and}}$Δw_(m)(k; τ) = w_(m)(k; τ) − w_(m)(k; τ − 1)using forgetting factor α. Therefore, w₁(k;τ) can be estimated asfollows.w ₁(k;τ)=U ₁(k;τ)A(k;τ)e ₁  [Mathematical Formula 23]

As occasion demands, the norming is performed.

In order to resolve scaling indeterminacy of the output signal byapplying a minimal distortion principle (MDP) to the obtained outputY₁(k,τ) w₁ ^(H)(k;τ)x(k,τ) obtained by using the w₁(k;τ), the diagonalelements of an inverse matrix of a separating matrix needs to beobtained.

Due to the structural features, the inverse matrix

$\left\lbrack \frac{w_{1}^{H}\left( {k;\tau} \right)}{{- \gamma_{k}}{❘l}} \right\rbrack^{- 1}$of the above-described matrix can be simply obtained by calculating onlya factor

$1/{\sum\limits_{m = 1}^{M}{\Gamma_{k}^{m - 1}\left\lbrack {w_{1}^{H}\left( {k;\tau} \right)} \right\rbrack}_{m}}$for the target output and multiplying the factor to the output.

The real time target speech extraction method which the distortionlessconstraint and nullforming constraint are added to the auxiliaryfunction is exchanged to the mathematical formula 24 from themathematical formula 10.

$\begin{matrix}{{w_{1}\left( {k;\tau} \right)} = \frac{{U_{1}\left( {k;\tau} \right)}{A\left( {k;\tau} \right)}e_{1}}{{h^{H}(k)}{U_{1}\left( {k;\tau} \right)}{A\left( {k;\tau} \right)}e_{1}}} & {\left\lbrack {{Mathematical}{Formula}24} \right\rbrack}\end{matrix}$

If w_(m) ^(H)(k;τ), m≠1 is estimated repeatedly, the followingmathematical formula.

$\begin{matrix}{{w_{m}\left( {k;\tau} \right)} = {{U_{m}\left( {k;\tau} \right)}{A\left( {k;\tau} \right)}e_{m}}} & \left\lbrack {{Mathematical}{Formula}25} \right\rbrack\end{matrix}$${{w_{m}\left( {k;\tau} \right)} = {\frac{w_{m}\left( {k;\tau} \right)}{\sqrt{{w_{m}^{H}\left( {k;\tau} \right)}{V_{m}\left( {k;\tau} \right)}{w_{m}\left( {k;\tau} \right)}}}{or}}},$${w_{m}\left( {k;\tau} \right)} = \frac{w_{m}\left( {k;\tau} \right)}{\sqrt{{w_{m}^{H}\left( {k;\tau} \right)}{w_{m}\left( {k;\tau} \right)}}}$

Similarly, by arranging mathematical formula 21, the adaptive vector(m=1) for the target channel can be expressed as follow:

$\left. {w_{1}\left( {k;\tau} \right)}\leftarrow{\frac{{U_{1}\left( {k;\tau} \right)}{h^{H}\left( {k;\tau} \right)}}{\left. {h^{h}\left\{ {k;\tau} \right.} \right){U_{1}\left( {k;\tau} \right)}{h^{H}\left( {k;\tau} \right)}}.} \right.$

By reflecting the Lagrange Multiplier and arranging the mathematicalformula 25, the adaptive vector 2≤m≤M for the noise channel can beexpressed as follow:

u_(m)(k; τ) = U_(m)(k; τ)A(k; τ)e_(m)${{\hat{u}}_{m}\left( {k;\tau} \right)} = \frac{{U_{m}\left( {k;\tau} \right)}{h^{H}\left( {k;\tau} \right)}}{{h^{H}\left( {k;\tau} \right)}{U_{m}\left( {k;\tau} \right)}{h^{H}\left( {k;\tau} \right)}}$w_(m)(k; τ) ← u_(m)(k; τ) − (h^(H)(k; τ)u_(m)(k; τ))û_(m)(k; τ).

Next, a time domain waveform of the estimated target speech signal canbe reconstructed by Mathematical Formula 26.

$\begin{matrix}{{y(t)} = {\sum\limits_{r}{\sum\limits_{k = 1}^{K}{{Y_{1}\left( {k,\tau} \right)}e^{j{\omega_{k}({t - {\tau H}})}}}}}} & {\left\lbrack {{Mathematical}{Formula}26} \right\rbrack}\end{matrix}$

FIG. 2 is a flowchart illustrating sequentially the procedure ofalgorithm of the target speech extraction method according to thepresent invention.

FIG. 3 is a table illustrating comparison of calculation amountsrequired for calculating values of the first column of one data framebetween a method according to the present invention and a real-time FDICA method of the related art.

In FIG. 3 , M denotes the number of input signals as the number ofmicrophones. K denotes frequency resolution as the number of frequencybins. O(M) and O(M³) denotes a calculation amount with respect to amatrix inverse transformation. It can be understood from FIG. 3 that themethod of the related art requires more additional computations than themethod according to the present invention in order to resolve thepermutation problem and to identify the target speech output.

FIG. 4 is a configurational diagram illustrating a simulationenvironment configured in order to compare performance between themethod according to the present invention and methods of the relatedart. Referring to FIG. 4 , there is a room having a size of 3 m×4 mwhere two microphones Mic.1 and Mic.2 and a target speech source T areprovided and three interference speech sources Interference 1,Interference 2, and Interference 3 are provided. FIGS. 5A to 5I aregraphs of results of simulation of the method according to the presentinvention (referred to as ‘DC ICA’), a first method of the related art(referred to as ‘SBSE’), a second method of the related art (referred toas ‘BSSA’, and a third method of the related art (referred to as ‘RTIVA’) while adjusting the number of interference speech sources underthe simulation environment of FIG. 4 . FIG. 5A illustrates a case wherethere is one interference speech source Interference 1 and RT₆₀=0.2 s.FIG. 5 b illustrates a case where there is one interference speechsource Interference 1 and RT₆₀=0.4 s. FIG. 5C illustrates a case wherethere is one interference speech source Interference 1 and RT₆₀=0.6 s.FIG. 5D illustrates a case where there are two interference speechsources Interference 1 and Interference 2 and RT₆₀=0.2 s. FIG. 5Eillustrates a case where there are two interference speech sources(Interference 1 and Interference 2 and RT₆₀=0.4 s. FIG. 5F illustrates acase where there are two interference speech sources (Interference 1 andInterference 2 and RT₆₀=0.6 s. FIG. 5G illustrates a case where threeare two interference speech sources Interference 1, Interference 2, andInterference 3 and RT₆₀=0.2 s. FIG. 5H illustrates a case where threeare two interference speech sources Interference 1, Interference 2, andInterference 3 and RT₆₀=0.4 s. FIG. 5I illustrates a case where threeare two interference speech sources Interference 1, Interference 2, andInterference 3 and RT₆₀=0.6 s. In each graph, the horizontal axisdenotes an input SNR (dB), and the vertical axis denotes word accuracy(%).

It can be easily understood from FIGS. 5A to 5I that the accuracy of themethod according to the present invention is higher than those of themethods of the related art.

FIGS. 6A to 6I are graphs of results of simulation the method accordingto the present invention (referred to as ‘DC ICA’), the first method ofthe related art (referred to as ‘SBSE’), a second method of the relatedart (referred to as ‘BSSA’), and a third method of the related art(referred to as ‘RT IVA’) by using various types of noise samples underthe simulation environment of FIG. 4 . FIG. 6A illustrates a case ofsubway noise and R T₆₀=0.2 s. FIG. 6B illustrates a case of subway noiseand R T₆₀=0.4 s. FIG. 6C illustrates a case of subway noise and RT₆₀=0.6 s. FIG. 6D illustrates a case of car noise and R T₆₀=0.2 s. FIG.6E illustrates a case of car noise and R T₆₀=0.4 s. FIG. 6F illustratesa case of car noise and R T₆₀=0.6 s. FIG. 6G illustrates a case ofexhibition hall noise and R T₆₀=0.2 s. FIG. 6H illustrates a case ofexhibition hall noise and R T₆₀=0.4 s. FIG. 6I illustrates a case ofexhibition hall noise and R T₆₀=0.6 s. In each graph, the horizontalaxis denotes an input SNR (dB), and the vertical axis denotes wordaccuracy (%).

It can be easily understood from FIGS. 6A to 6I that the accuracy of themethod according to the present invention is higher than those of themethods of the related art with respect to all kinds of noise.

FIGS. 7A and 7B illustrate a subband clique and a harmonic clique as twotypical clique cases.

While the present invention has been particularly shown and describedwith reference to exemplary embodiments thereof, it will be understoodby those skilled in the art that various changes in form and details maybe made therein without departing from the spirit and scope of thepresent invention as defined by the appended claims.

A target speech signal extraction method according to the presentinvention can be used as a pre-processing method of a speech recognitionsystem.

What is claimed is:
 1. A target speech signal extraction method ofextracting a target speech signal from input signals input to at leasttwo or more microphones for robust speech recognition, by a processor ofa speech recognition apparatus, comprising: (a) initializing a steeringvector for a target speech source and an adaptive vector (w₁(k)) for thetarget speech source, setting a real output channel of the target speechsource as an output by the adaptive vector for the target speech source,initializing adaptive vectors (w₂(k), . . . , w_(M)(k)) for a noise andsetting a dummy channel as an output by the adaptive vectors for thenoise; (b) setting a cost function for minimizing dependency between areal output for the target speech source and a dummy output for thenoise using independent component analysis (ICA) or independent vectoranalysis (IVA); (c) setting an auxiliary function to the cost function,and updating the adaptive vector (w₁(k)) for the target speech sourceand the adaptive vectors (w₂(k), . . . , w_(M)(k)) for the noise byusing the auxiliary function and the steering vector for the targetspeech source; (d) estimating the target speech signal by using theadaptive vector for the target speech source thereby extracting thetarget speech signal from the input signals; and (e) updating thesteering vector for the target speech source; wherein the (b)˜(e) areperformed repeatedly; and wherein the auxiliary function is set aninequality relation so that the auxiliary function has always valuesgreater than or same as that of the cost function.
 2. The target speechsignal extraction method according to claim 1, wherein the (e) includesof (e1) estimating a target mask which is defined a value representing aratio of a target speech signal power to a sum of the target speechsignal power and a noise signal power; (e2) estimating a covariancematrix for the target speech source by using the estimated target mask;(e3) obtaining a principal eigen vector by eigen-vector decomposing theestimated covariance matrix; and (e4) estimating the steering vectorfrom the principal eigen vector to update the steering vector.
 3. Thetarget speech signal extraction method according to claim 1, Wherein the(b) sets a condition in which a value of multiplying the steering vectorand an input signal of a microphone becomes 1 in an output of the targetspeech and becomes 0 in the dummy output, and reflects the condition tothe auxiliary function.
 4. The target speech signal extraction methodaccording to claim 1, wherein a probability density function of the costfunction is modeling by a generalized Gaussian distribution.
 5. Thetarget speech signal extraction method according to claim 4, wherein thegeneralized gaussian distribution has a varying variance with regard totime-frequency or one of time and frequency, and wherein the (c)includes of updating alternately the varying variance λ, the adaptivevector (w₁(k)) for the target speech source and the adaptive vectors(w₂(k), . . . , w_(M)(k)) for the noise by using the auxiliary function,and estimating the target speech source by using the updated adaptivevectors.
 6. The target speech signal extraction method according toclaim 4, wherein the generalized gaussian distribution has a constantvariance, and wherein the (d) includes of learning the cost function toupdate the adaptive vector (w₁(k)) and estimating the target speechsource by using the updated adaptive vector.
 7. The target speech signalextraction method according to claim 2, the (e1) includes of: applying aminimal distortion principle (MDP) using a target speech element of adiagonal elements in an inverse matrix of a separating matrix to theadaptive vectors for target speech source and noise estimated in the(c); and estimating a power of the target speech signal and noise signalby using the separating matrix and estimating the target mask by usingthe estimated power.
 8. The target speech signal extraction methodaccording to claim 1, wherein a time domain waveform y(t) of anestimated target speech signal is expressed by the followingMathematical Formula,${{y(t)} = {\sum\limits_{\tau}{\sum\limits_{k = 1}^{K}{{Y_{i}\left( {k,\tau} \right)}e^{j{\omega_{k}({t - {\tau H}})}}}}}},$ and wherein Y₁(k,τ)=w₁(k)x(k,τ), w₁(k) denotes an adaptive vector forgenerating a real output for the target speech source, and k and τdenote a frequency bin number and a frame number, respectively.
 9. Anon-transitory computer readable storage media having programinstructions that, when executed by a processor of a speech recognitionapparatus, cause the processor to perform the target speech signalextraction method according to claim 1.